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TINY FLOATING POINT 

Two basic disadvantages of fixed-point arithmetic are: (1) The range of numbers that can be 
represented is small e.g. with a L bits signed two's complement fixed format, the smallest number is -1 
and the largest is l-s A -L. (2) The relative error, which can be thought of as percentage error, increases 
as the magnitude of the number is decreased. A floating point format in general leads to increased 
dynamic range and constant relative error. 

This section describes a tiny IEEE like floating point format; tiny because the total number of bits is less 
than IEEE32. The errors , the implication of trading mantissa and exponent bits, and selection of a bias is 
discussed in the following sections. Other non-fixed point format such as the compressed Z or 
logarithymic may be considered in the future, tiny IEEE is chosen to be the prime candidate because of 
it's well understood behavior and hardware implementation. 

The discussion to follow uses L=16 as examples because 16 bits is most likely the basic unit of transfer 
among memory and other subsystems. 



Relative Error 

The relative error is the ratio of the absolute error i.e. the difference between a value x and the 
corresponding quantized value, to the value x. 

Given L bits in a fixed-point format, the absolute error is (1/2)2 A (-L) / x. In other words, the relative 
error of a fixed point representation is larger when the represented value is less than 1 than when it is 
greater than 1. 

In contrast, floating point format offers constant relative error (l/2*2^(-L)) where L is the number of 
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Relative Error 

The relative error is the ratio of the absolute error i.e. the difference between a value x and the 
corresponding quantized value, to the value x. 

Given L bits In a fixed-point format, the absolute error is (1/2)2 A (-L) / x. In other words, the relative 
error of a fixed point representation is larger when the represented value is less than 1 than when it is 
greater than 1. 

In contrast, floating point format offers constant relative error (1/2*2 A (-L)) where L is the number of 
mantissa bits. Therefore, the absolute error of floating point format is smaller when the represented 
value Is less than 1 than when It is greater than 1; a desirable property since we want precision for 
color values within the displayable range i.e. (0,1) and are willing to trade off that precision in extended 
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range when the value is greater than 1 . 



The absolute error of siOe5is tabulated in Table l for each of the 32 exponent value. The error is 
defined as the width of quantization: the smallest number variation that can be represented at the least 
significant bit of mantissa. It's interesting to observe that for number close to zero, the absoulte error is 
2 A 25 which is way better than a 16 bit fixed number. 

Table lAbsolute error of sl0e5 in each exponent range. 
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Trade off between exponent and mantissa 

Given a fixed number of bits, the partitioning of these bits into either the exponent or mantissa fields 
becomes an exercise of trading off range with relative error, In general, increasing number of exponent 
bits and decreasing number of mantissa bits increases range at the expense of increased relative error. 
In Plotl, the x axis is the 16 bit number interpreted as an unsigned integer and the y -axis is the 16 bit 
number as floating point. Observe that when a bit is moved from the mantissa field to the exponent field 
i.e. sl0e5 to s9e6 or slle4 to sl0e5, the range is extended on both sides: the largest representable 
number is s A 16 times larger and the smallest representable number is 2^-16 times smaller. 




Where do these numbers in the extended range come from ? Histogram 1 gives a picture on how the 
bits are re-distributed when mantissa bits are trade off with exponent bits. In Histogram 1, the bins are 
logarithmic in size i.e. bin 0 (0,1), bin 1 (1,2), bin 2 (2, 4) etc. Note that for all three formats in the 
histogram, the sizes of bin 0 remain the same. As we go from sl0e5 to s9e6, the number of numbers 
between (0,1) remains the same, and some of the numbers which were above 1.0 is re -distributed to 
the new extended range. 

Also note that for sl0e5, about half the numbers are within (0,+/-l) range, 
Histogram 1 



Choosing a bias 

Changing the bias can also has an effect on range, but not on relative error. As shown in plot 2, 
changing the bias from 15 to 14 has double both the largest and the smallest representable number. 
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Histogram 2 shows how the numbers are re-distributed when bias is changed. As the bias is changed 
from 17 to 12, the number of values in bin zero decreases as the range increases while the sizes of all 
other bins remain constant. Effectively, the range is extended at the expense of having less values 
between (0.0, 1.0). 
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12 bit fixed versus S10E5 

When using S10E5 as the canonical intermediate format in the rchip, one desirable property is to 
preserve 12 bits fixed accuracy when we convert from 12 bit fixed to S10E5 back to 12 bit fixed. 

From Table 1 , the absolute error in the range (-.499875, .499875) has more than or equal to 12 bits 
absolute error. So we can linearly map (0, 4095) 12 bit fixed to (-.499875, .499875) and back to (0, 
4095) without losing any bits. A simple program using arith.c verifies that this is true. 
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